The dataset being analyzed comprises four omics data types: drug response data, DNA methylation, RNA-seq, and somatic mutations, derived from blood samples of 200 patients with Chronic Lymphocytic Leukemia (CLL). Nearly 40% of the 200 samples were profiled with some but not all omics types. Sample metadata includes gender, age, time to next treatment (TTT), time to death (TTD), and binary indicators for whether a patient received treatment (treatedAfter) or passed away (died).
The four omics data types in this dataset were processed as follows:
The data is structured as a list of matrices, with features in rows and samples in columns.
# Load libraries
library(MOFA2)
library(MOFAdata)
library(iClusterPlus)
library(data.table)
library(ggplot2)
library(tidyverse)
library(lattice)
library(pheatmap)
library(RColorBrewer)
library(gplots)
library(survival)
library(survminer)
library(stats)
library(graphics)
# Load data
utils::data("CLL_data")
lapply(CLL_data, dim)
## $Drugs
## [1] 310 200
##
## $Methylation
## [1] 4248 200
##
## $mRNA
## [1] 5000 200
##
## $Mutations
## [1] 69 200
# Load sample metadata
CLL_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/sample_metadata.txt")
The MOFA model takes m data matrices as input (\(Y^1\),. . ., \(Y^m\)), and decomposes these matrices into a matrix of factors (\(Z\)) for each sample and weight matrices, one for each data modality (\(W^1\),.., \(W^m\)). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol in the data matrices denotes missing values.
# Set seed for repeatability
set.seed(1234)
# Create the MOFA object
MOFAobject <- create_mofa(CLL_data)
MOFAobject
## Untrained MOFA model with the following characteristics:
## Number of views: 4
## Views names: Drugs Methylation mRNA Mutations
## Number of features (per view): 310 4248 5000 69
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 200
##
# Data options
data_opts <- get_default_data_options(MOFAobject)
# Model options
model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 15
# Training options
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "slow"
train_opts$seed <- 42
# Prepare the MOFA object
MOFAobject <- prepare_mofa(MOFAobject, data_options = data_opts,
model_options = model_opts, training_options = train_opts )
# Train the MOFA model
f <- "MOFAobject_CLL.rds"
if (file.exists(f)) {
load(f)
} else {
MOFAobject <- run_mofa(MOFAobject, use_basilisk=TRUE)
save(MOFAobject,file=f)
}
# Add sample metadata to the MOFA object
samples_metadata(MOFAobject) <- CLL_metadata
MOFA2 offers a data overview function to visualize key data set characteristics, including the number of views (rows), number of samples (columns), feature dimensions, and missing data (represented by grey bars).
plot_data_overview(MOFAobject)
plot_variance_explained(MOFAobject, max_r2=15)
This plot illustrates the percentage of variability explained by each factor across different data views.
plot_variance_explained(MOFAobject, plot_total = T)[[2]]
This plot illustrates the total variability explained by all factors. The explained variability depends on factors such as the number of samples and the number of factors. In this dataset, the MOFA model with 15 factors accounts for approximately 51% of the variability in drug response data and 43% in mRNA data.
plot_factor_cor(MOFAobject)
This plot displays the correlations between factor scores, showing weak correlations among the 15 factors.
Features with no association with a given factor are expected to have weights close to zero, whereas features with strong association are expected to have large absolute weights. The sign of the weights indicates the direction of the effect: a positive weight suggests that the feature is more abundant in cells with positive factor values, whereas a negative weight indicates higher levels in cells with negative factor values.
plot_top_weights(MOFAobject, view = "Mutations",
factor = 1, nfeatures = 10, scale = T )
This plot displays the top 10 weights on Factor 1 for somatic mutation data, with IGHV having the highest positive weight on Factor 1.
plot_top_weights(MOFAobject, view = "Mutations",
factor = 3, nfeatures = 10, scale = T )
This plot presents the top 10 weights on Factor 3 for somatic mutation data, with trisomy 12 having the highest positive weight on Factor 3.
plot_factor(MOFAobject, factors = 1, color_by = "IGHV",
add_violin = TRUE, dodge = TRUE )
This plot displays Factor 1 scores, with samples colored by IGHV mutation status. Samples with positive Factor 1 scores have the IGHV mutation, while those with negative Factor 1 scores do not.
plot_factors(MOFAobject, factors = c(1,3),
color_by = "IGHV", shape_by = "trisomy12",
dot_size = 2.5, show_missing = T ) +
geom_hline(yintercept=1, linetype="dashed") +
geom_vline(xintercept=(-0.5), linetype="dashed")
This scatter plot displays Factor 1 versus Factor 3, with points colored by IGHV mutation status and shaped according to trisomy 12 mutation status. Samples with positive Factor 3 scores carry the trisomy 12 mutation, while those with negative Factor 3 scores do not.
Feature data (e.g., mRNA) can be explored using scatter plots of factor scores versus expression values for the top features with the largest positive and negative weights. Samples are colored based on IGHV mutation status.
plot_data_scatter(MOFAobject, view = "mRNA",
factor = 1, features = 4,
sign = "positive", color_by = "IGHV") +
labs(y="RNA expression")
# ESPNL, GLRA3, DPF3, ADAM29
plot_data_scatter(MOFAobject, view = "mRNA",
factor = 1, features = 4,
sign = "negative", color_by = "IGHV" ) +
labs(y="RNA expression")
# ZNF471, SEPTIN10, SOWAHC, ZNF667
A gene expression heatmap displaying the top 50 genes with the highest absolute weights on Factor 1
plot_data_heatmap(MOFAobject, view = "mRNA",
factor = 1, features = 50,
denoise = TRUE, cluster_rows = FALSE,
cluster_cols = FALSE, show_rownames = TRUE,
show_colnames = FALSE, scale = "row")
Gene set enrichment analysis can be performed to identify significant associations between factors and gene sets.
utils::data(reactomeGS)
# GSEA on positive weights
res.positive <- run_enrichment(MOFAobject, feature.sets = reactomeGS,
view = "mRNA", sign = "positive" )
# GSEA on negative weights
res.negative <- run_enrichment(MOFAobject, feature.sets = reactomeGS,
view = "mRNA", sign = "negative" )
plot_enrichment_heatmap(res.positive)
plot_enrichment_heatmap(res.negative)
plot_enrichment(res.positive, factor = 4, max.pathways = 15)
plot_enrichment(res.negative, factor = 5, max.pathways = 15)
The enrichment heatmaps provide an overview of significant pathways associated with each factor. Factor 4 shows strong enrichment for genes with positive weights, while Factor 5 exhibits strong enrichment for genes with negative weights. This suggests that Factor 4 primarily captures variability in the immune response, whereas Factor 5 reflects variability in the stress response of blood cells.
set.seed(123)
# UMAP
umap <- run_umap(MOFAobject)
plot_dimred(umap, method='UMAP', color_by='IGHV',
show_missing=FALSE, dot_siz=2.5)
# T-SNE
tsne <- run_tsne(MOFAobject)
plot_dimred(tsne, method='TSNE', color_by='IGHV',
show_missing=FALSE, dot_siz=2.5)
Samples are visualized on dimension reduction plots (UMAP and t-SNE) based on factor scores and colored according to IGHV mutation status.
A heatmap with two-way clustering of factor scores is presented, along with selected sample metadata for additional context.
colbar <- data.frame(IGHV=MOFAobject@samples_metadata$IGHV,
trisomy12=MOFAobject@samples_metadata$trisomy12,
treated=MOFAobject@samples_metadata$treatedAfter*1,
died=MOFAobject@samples_metadata$died*1,
gender=MOFAobject@samples_metadata$Gender)
rownames(colbar) <- MOFAobject@samples_metadata$sample
Z <- get_factors(MOFAobject)[[1]]
Z.image <- Z
Z.image[Z.image>3]=3
Z.image[Z.image< -3]= -3
pheatmap(t(Z.image), color=bluered(256), scale="none",
cell_width=NA, cellheight=10, border_color=NA,
cluster_rows = FALSE, show_rownames = TRUE,
show_colnames = FALSE, cluster_cols = TRUE,
clustering_distance_cols = "correlation",
clustering_distance_rows = "correlation",
clustering_method = "ward.D2",
annotation_colors = NA, annotation_col = colbar,
fontsize_row = 8, fontsize_col = 8, fontsize = 5)
The factors can be linked to clinical outcomes (time to treatment) using the Cox model.
SurvObject <- Surv(MOFAobject@samples_metadata$TTT, MOFAobject@samples_metadata$treatedAfter)
fit <- coxph(SurvObject ~ Z)
fit
## Call:
## coxph(formula = SurvObject ~ Z)
##
## coef exp(coef) se(coef) z p
## ZFactor1 -0.51825 0.59556 0.10435 -4.967 6.81e-07
## ZFactor2 0.58432 1.79377 0.16157 3.616 0.000299
## ZFactor3 0.09219 1.09657 0.12475 0.739 0.459916
## ZFactor4 -0.01270 0.98738 0.09899 -0.128 0.897891
## ZFactor5 -0.55458 0.57431 0.12501 -4.436 9.16e-06
## ZFactor6 0.24545 1.27819 0.13664 1.796 0.072454
## ZFactor7 -0.62514 0.53518 0.12516 -4.995 5.89e-07
## ZFactor8 0.36344 1.43827 0.11575 3.140 0.001691
## ZFactor9 0.07399 1.07680 0.10567 0.700 0.483813
## ZFactor10 -0.54399 0.58043 0.12328 -4.412 1.02e-05
## ZFactor11 0.18830 1.20720 0.10883 1.730 0.083589
## ZFactor12 0.03170 1.03221 0.12825 0.247 0.804765
## ZFactor13 0.44991 1.56817 0.14166 3.176 0.001493
## ZFactor14 -0.04036 0.96045 0.13232 -0.305 0.760362
## ZFactor15 0.04323 1.04418 0.08830 0.490 0.624408
##
## Likelihood ratio test=108.7 on 15 df, p=2.927e-16
## n= 174, number of events= 96
## (26 observations deleted due to missingness)
s <- summary(fit)
coef <- s[["coefficients"]]
df <- data.frame(factor = factor(rownames(coef), levels = rev(rownames(coef))),
p = coef[,"Pr(>|z|)"], coef = coef[,"exp(coef)"],
lower = s[["conf.int"]][,"lower .95"],
higher = s[["conf.int"]][,"upper .95"])
ggplot(df, aes(x=factor, y=coef, ymin=lower, ymax=higher)) +
geom_pointrange( col='#619CFF') + coord_flip() +
scale_x_discrete() + labs(y="Hazard Ratio", x="") +
geom_hline(aes(yintercept=1), linetype="dotted") +
theme_bw()
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], Z1 = Z[,1])
cut <- surv_cutpoint(df, variables='Z1')
df$FactorCluster <- df$Z1 > cut$cutpoint$cutpoint
fit <- survfit(Surv(time, event) ~ FactorCluster, df)
ggsurvplot(fit, data = df, conf.int = TRUE, pval = TRUE, legend = "top",
legend.labs = c(paste("Low Z1"), paste("High Z1")),
xlab = "Time to treatment", ylab="Survival probability",
title= "Factor 1")$plot
Several factors are significantly associated with time to treatment. When samples are divided into two groups based on Factor 1, a significant difference in survival is observed between the groups.
Samples were clustered using K-means based on factor scores, and the resulting clusters were analyzed in relation to time to treatment by Kaplan-Meier curves. The survival curves show clear separation across different sample clusters.
set.seed(123)
clusters.MOFA <- cbind(K1=kmeans(Z,2)$cluster,
K2=kmeans(Z,3)$cluster,
K3=kmeans(Z,4)$cluster,
K4=kmeans(Z,5)$cluster,
K5=kmeans(Z,6)$cluster)
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.MOFA[,1])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Two Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.MOFA[,2])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Three Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.MOFA[,3])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Four Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.MOFA[,4])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Five Clusters")$plot
The iCluster model takes multiple-omics data sets (A) to identify common latent variables that can be used to cluster patient samples in a lower dimensional integrated latent variable space (B). Simultaneously, driver omics features (e.g., gene expression, DNA methylation, and somatic mutation) that contribute to sample clustering are identified (C).
The R iClusterPlus package requires a complete dataset without missing values. Therefore, we imputed the missing data using MOFA2.
MOFAobject <- impute(MOFAobject)
imputed_data <- get_imputed_data(MOFAobject)
CLL_imputed <- list(Drugs=t(imputed_data$Drugs$group1),
Methylation=t(imputed_data$Methylation$group1),
mRNA=t(imputed_data$mRNA$group1),
Mutations=t(imputed_data$Mutations$group1))
iClusterPlus does not produce results using the imputed data from all four omics data types. Therefore, we use only the imputed methylation and mRNA data for iClusterPlus analysis.
iClusterPlus fits a regularized latent variable model-based clustering approach, generating an integrated cluster assignment through joint inference across multiple data types. Below is a one-liner to run iClusterPlus, given the desired number of k (where the number of clusters is k+1) and a specified set of lambda parameter values. The optimal k and lambda must be determined through model tuning. If the number of clusters in the samples is known, we can directly choose the corresponding k (the number of latent variables) for cluster analysis. If the cluster number is unknown, we can explore different values of k, testing from 1 to N (a reasonable range of clusters). In this example, we set k from 1 to 5. Note: This process may take several days to complete when running on a typical laptop.
f <- file.path("./", "cv.fit.k1.Rdata")
if (file.exists(f)) {
output2=alist()
files=grep("cv.fit",dir())
for(i in 1:length(files)) {
load(dir()[files[i]])
output2[[i]]=cv.fit
}
} else {
set.seed(123)
for(k in 1:5) {
print(k)
cv.fit = tune.iClusterPlus(cpus=1,dt1=CLL_imputed[[2]],dt2=CLL_imputed[[3]],
type=c("gaussian","gaussian"), K=k, n.lambda=NULL,
scale.lambda=c(1,1), maxiter=20)
save(cv.fit,file=paste("cv.fit.k",k,".Rdata",sep=""))
}
}
nLambda = nrow(output2[[1]]$lambda)
nK = length(output2)
BIC = getBIC(output2)
devR = getDevR(output2)
minBICid = apply(BIC,2,which.min)
devRatMinBIC = rep(NA,nK)
for(i in 1:nK){
devRatMinBIC[i] = devR[minBICid[i],i]
}
For each k, we use the Bayesian Information Criterion (BIC) to identify the best sparse model with the optimal combination of penalty parameters. To determine the optimal k, we compute the deviance ratio. This deviance ratio can be interpreted as the percentage of explained variance. We select k where the percentage curve stabilizes, indicating the optimal number of clusters.
clusters=getClusters(output2)
rownames(clusters)=rownames(CLL_imputed[[2]])
colnames(clusters)=paste("K=",2:(length(output2)+1),sep="")
plot(1:(nK+1),c(0,devRatMinBIC),type="b",xlab="Number of clusters (K+1)",
ylab="% Explained Variation")
For noisy data, the deviance ratio generally increases as k increases. In this case, we generated heatmaps of the datasets to visually assess clustering patterns. The optimal number of clusters should be chosen based on the observed structure and feature patterns in the heatmaps.
Heatmap for two sample clusters:
col.scheme=alist()
col.scheme[[1]]=bluered(256)
col.scheme[[2]]=bluered(256)
meth.image=CLL_imputed[[2]]
meth.image[meth.image>3]=3
meth.image[meth.image< -3]= -3
exp.image=scale(CLL_imputed[[3]])
exp.image[exp.image>3]=3
exp.image[exp.image< -3]= -3
best.fit=output2[[1]]$fit[[which.min(BIC[,1])]]
plotHeatmap(fit=best.fit, datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
row.order=c(T,T),sparse=c(T,T),cap=c(T,T))
Heatmap for three sample clusters:
best.fit=output2[[2]]$fit[[which.min(BIC[,2])]]
plotHeatmap(fit=best.fit, datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
row.order=c(T,T),sparse=c(T,T),cap=c(T,T))
Heatmap for four sample clusters:
best.fit=output2[[3]]$fit[[which.min(BIC[,3])]]
plotHeatmap(fit=best.fit, datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
row.order=c(T,T),sparse=c(T,T),cap=c(T,T))
Heatmap for five sample clusters:
best.fit=output2[[4]]$fit[[which.min(BIC[,4])]]
plotHeatmap(fit=best.fit, datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
row.order=c(T,T),sparse=c(T,T),cap=c(T,T))
The resulting clusters by iClusterPlus were analyzed in relation to time to treatment by Kaplan-Meier curves. The survival curves show clear separation across different sample clusters.
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters[,1])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Two Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters[,2])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Three Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters[,3])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Four Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters[,4])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Five Clusters")$plot
iClusterBayes is a Bayesian integrative clustering method (Mo et al., 2018). Unlike iClusterPlus, this method eliminates the need for a grid search to determine the optimal lasso parameter, significantly reducing computational requirements, especially for large datasets.
f <- "Bayfit.Rda"
if (file.exists(f)) {
load(f)
} else {
set.seed(123)
bayfit = tune.iClusterBayes(cpus=1,dt1=CLL_imputed[[2]],dt2=CLL_imputed[[3]],
type=c("gaussian","gaussian"),K=1:6,
n.burnin=18000, n.draw=12000,
prior.gamma=c(0.5,0.5), sdev=0.05, thin=3)
save(bayfit,file=f)
}
allBIC = NULL
devratio = NULL
nK = length(bayfit$fit)
for(i in 1:nK){
allBIC = c(allBIC,bayfit$fit[[i]]$BIC)
devratio = c(devratio,bayfit$fit[[i]]$dev.ratio)
}
BIC or deviance ratio can be used as criteria to determine the optimal k, which defines the number of clusters. However, in noisy datasets, the deviance ratio may continue to increase, and the BIC may keep decreasing as k increases. In such cases, we suggest generating heatmaps of the datasets and selecting the optimal number of clusters based on the observed feature patterns.
clusters.Bayes <- cbind(K1=bayfit$fit[[1]]$clusters,
K2=bayfit$fit[[2]]$clusters,
K3=bayfit$fit[[3]]$clusters,
K4=bayfit$fit[[4]]$clusters,
K5=bayfit$fit[[5]]$clusters)
rownames(clusters.Bayes)=rownames(CLL_imputed[[2]])
par(mfrow=c(1,2))
plot(1:nK, allBIC,type="b",xlab="k",ylab="BIC",pch=c(1,1,1,1,1,1))
plot(1:nK,devratio,type="b",xlab="k",ylab="Deviance ratio",pch=c(1,1,1,1,1,1))
par(mfrow=c(1,1))
The posterior probability of features can be used as a criterion for selecting features that drive the integrative clustering.
Posteriors for two sample clusters:
par(mfrow=c(1,2))
plot(bayfit$fit[[1]]$beta.pp[[1]],xlab="Genes",ylab="Posterior probability",
main="Methylation")
plot(bayfit$fit[[1]]$beta.pp[[2]],xlab="Genes",ylab="Posterior probability",
main="Expression")
par(mfrow=c(1,1))
Posteriors for three sample clusters:
par(mfrow=c(1,2))
plot(bayfit$fit[[2]]$beta.pp[[1]],xlab="Genes",ylab="Posterior probability",
main="Methylation")
plot(bayfit$fit[[2]]$beta.pp[[2]],xlab="Genes",ylab="Posterior probability",
main="Expression")
par(mfrow=c(1,1))
Posteriors for four sample clusters:
par(mfrow=c(1,2))
plot(bayfit$fit[[3]]$beta.pp[[1]],xlab="Genes",ylab="Posterior probability",
main="Methylation")
plot(bayfit$fit[[3]]$beta.pp[[2]],xlab="Genes",ylab="Posterior probability",
main="Expression")
par(mfrow=c(1,1))
Posteriors for five sample clusters:
par(mfrow=c(1,2))
plot(bayfit$fit[[4]]$beta.pp[[1]],xlab="Genes",ylab="Posterior probability",
main="Methylation")
plot(bayfit$fit[[4]]$beta.pp[[2]],xlab="Genes",ylab="Posterior probability",
main="Expression")
par(mfrow=c(1,1))
Heatmap for two sample clusters:
col.scheme=alist()
col.scheme[[1]]=bluered(256)
col.scheme[[2]]=bluered(256)
plotHMBayes(fit=bayfit$fit[[1]],datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
threshold=c(0.5,0.5),row.order=c(F,T),plot.chr=c(T,T),
sparse=c(T,T),cap=c(T,T))
## 1 3921
## 2 2375
Heatmap for three sample clusters:
plotHMBayes(fit=bayfit$fit[[2]],datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
threshold=c(0.5,0.5),row.order=c(F,T),plot.chr=c(T,T),
sparse=c(T,T),cap=c(T,T))
## 1 4153
## 2 2785
Heatmap for four sample clusters:
plotHMBayes(fit=bayfit$fit[[3]],datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
threshold=c(0.5,0.5),row.order=c(F,T),plot.chr=c(T,T),
sparse=c(T,T),cap=c(T,T))
## 1 4220
## 2 2664
Heatmap for five sample clusters:
plotHMBayes(fit=bayfit$fit[[4]],datasets=list(meth.image,exp.image),
type=c("gaussian","gaussian"), col.scheme = col.scheme,
threshold=c(0.5,0.5),row.order=c(F,T),plot.chr=c(T,T),
sparse=c(T,T),cap=c(T,T))
## 1 4236
## 2 2698
The resulting clusters by iClusterBayes were analyzed in relation to time to treatment by Kaplan-Meier curves. The survival curves show clear separation across different sample clusters.
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.Bayes[,1])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Two Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.Bayes[,2])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Three Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.Bayes[,3])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Four Clusters")$plot
df <- data.frame(time = SurvObject[,1], event = SurvObject[,2], cluster = clusters.Bayes[,4])
fitSurvflt <- survfit(SurvObject ~ cluster, df)
ggsurvplot(fitSurvflt, pval = TRUE, legend = "none",
xlab = "Time to treatment", ylab="Survival probability",
title= "Five Clusters")$plot
Now, let’s compare the sample clusters generated by the three methods. For two sample clusters, the methods produce similar results, indicating consistent clustering across approaches. However, for more than two sample clusters, the results become less consistent.
table(MOFA=clusters.MOFA[,1],iClusterPlus=clusters[,1])
## iClusterPlus
## MOFA 1 2
## 1 78 0
## 2 2 120
table(MOFA=clusters.MOFA[,1],iClusterBayes=clusters.Bayes[,1])
## iClusterBayes
## MOFA 1 2
## 1 78 0
## 2 11 111
table(iClusterPlus=clusters[,1],iClusterBayes=clusters.Bayes[,1])
## iClusterBayes
## iClusterPlus 1 2
## 1 80 0
## 2 9 111
table(MOFA=clusters.MOFA[,2],iClusterPlus=clusters[,2])
## iClusterPlus
## MOFA 1 2 3
## 1 32 0 45
## 2 5 73 0
## 3 13 1 31
table(MOFA=clusters.MOFA[,2],iClusterBayes=clusters.Bayes[,2])
## iClusterBayes
## MOFA 1 2 3
## 1 0 11 66
## 2 41 37 0
## 3 1 6 38
table(iClusterPlus=clusters[,2],iClusterBayes=clusters.Bayes[,2])
## iClusterBayes
## iClusterPlus 1 2 3
## 1 3 9 38
## 2 39 35 0
## 3 0 10 66
table(MOFA=clusters.MOFA[,3],iClusterPlus=clusters[,3])
## iClusterPlus
## MOFA 1 2 3 4
## 1 2 3 9 15
## 2 69 0 9 0
## 3 0 20 5 13
## 4 0 12 6 37
table(MOFA=clusters.MOFA[,3],iClusterBayes=clusters.Bayes[,3])
## iClusterBayes
## MOFA 1 2 3 4
## 1 28 0 0 1
## 2 0 0 37 41
## 3 12 26 0 0
## 4 11 43 0 1
table(iClusterPlus=clusters[,3],iClusterBayes=clusters.Bayes[,3])
## iClusterBayes
## iClusterPlus 1 2 3 4
## 1 1 0 32 38
## 2 12 23 0 0
## 3 15 5 5 4
## 4 23 41 0 1
table(MOFA=clusters.MOFA[,4],iClusterPlus=clusters[,4])
## iClusterPlus
## MOFA 1 2 3 4 5
## 1 18 0 59 1 0
## 2 1 2 0 6 0
## 3 0 7 0 11 31
## 4 3 6 3 6 9
## 5 0 11 0 9 17
table(MOFA=clusters.MOFA[,4],iClusterBayes=clusters.Bayes[,4])
## iClusterBayes
## MOFA 1 2 3 4 5
## 1 0 41 0 0 37
## 2 3 0 2 4 0
## 3 8 1 22 18 0
## 4 25 1 1 0 0
## 5 4 0 11 22 0
table(iClusterPlus=clusters[,4],iClusterBayes=clusters.Bayes[,4])
## iClusterBayes
## iClusterPlus 1 2 3 4 5
## 1 3 10 0 1 8
## 2 8 0 5 13 0
## 3 2 32 0 0 28
## 4 15 0 9 8 1
## 5 12 1 22 22 0